Locality with parallel hierarchical copying garbage collection

ABSTRACT

Disclosed is a garbage collection algorithm that achieves hierarchical copy order with parallel garbage collection threads. More specifically, the present invention provides a garbage collection method and system for copying objects from a from-space to a to-space. The method comprises the steps of (a) having multiple threads that simultaneously perform work for garbage collection (GC), (b) examining the placement of objects on blocks, and (c) changing the placement of objects on blocks based on step (b). Preferably, the method includes the additional step of calculating a placement of object(s) based on step (b), and using the result of the calculation for step (c). For example, the calculation may be used to increase the frequency of intra-block pointers and/or to increase the frequency of siblings on the same block.

BACKGROUND OF THE INVENTION

1. Field of the Invention

This invention generally relates to automatic memory management, andmore specifically, the invention relates to methods and systems forcopying garbage collection.

2. Background Art

In operation, computer programs spend a lot of time stalled in cache andTLB misses, because computation tends to be faster than memory access.For example, Adl-Tabatabai et al. report that the SPECjbb2000 benchmarkspends 45% of its time stalled in misses on an Itanium processor[Ali-Reza Adl-Tabatabai, Richard L. Hudson, Mauricio J. Serrano, andSreenivas Subramoney. Prefetch injection based on hardware monitoringand object metadata. In Programming Language Design and Implementation(PLDI), 2004]. Better locality reduces misses, and thus improvesperformance. For example, techniques like prefetching or cache-awarememory allocation improve locality, and can significantly speedup theperformance of a program.

Locality is in part determined by the order of heap objects in memory.If two objects reside on the same cache line or page, then an access toone causes the system to fetch this cache line or page. A subsequentaccess to the other object is fast. Copying garbage collection (GC) canchange the order of objects in memory. To improve locality, copying GCshould strive to colocate related objects on the same cache line orpage.

Copying GC traverses the graph of heap objects, copies objects when itreaches them, and recycles memory of unreachable objects afterwards.Consider copying a binary tree of objects, where each cache line canhold three objects. When the traversal uses a FIFO queue, the order isbreadth-first and results in the cache line layout in FIG. 1A. When thetraversal uses a LIFO stack, the order is depth-first and results in thecache line layout in FIG. 1B. In both cases, most cache lines holdunconnected objects. For example, breadth-first order colocates o₁₀ ando₁₁ with o₁₂, even though o₁₂ will usually not be accessed together witho₁₀ or o₁₁.

Intuitively, it is better if an object occupies the same cache line asits siblings, parents, or children. Hierarchical copy order achievesthis (FIG. 1C). Moon invented a hierarchical GC in 1984, and Wilson,Lam, and Moher improved it in 1991 [Paul R. Wilson, Michael S. Lam, andThomas G. Moher. Effective “static-graph” reorganization to improvelocality in a garbage-collected system. In Programming Language Designand Implementation (PLDI), 1991], calling it “hierarchicaldecomposition”. The algorithms by Moon and by Wilson, Lam, and Moher useonly a single GC thread. Using multiple parallel GC threads reduces GCcost, and most product GCs today are parallel.

SUMMARY OF THE INVENTION

An object of this invention is to reduce cache and TLB misses bychanging the order in which a parallel garbage collector copies heapobjects.

Another object of the present invention is to provide a garbagecollection algorithm that achieves hierarchical copy order with parallelgarbage collection threads.

A further object of this invention is to improve locality with parallelhierarchical copying garbage collection.

Another object of the invention is to provide a garbage collectionalgorithm that both reduces cache and TLB misses through hierarchicalcopying and also maintains good scaling on multiprocessors.

These and other objectives are attained with a garbage collectionalgorithm that achieves hierarchical copy order with parallel garbagecollection threads. More specifically, the present invention provides agarbage collection method and system. The term “block” as used hereinrefers to a cache line or page or other unit of OS+HW support for memoryhierarchy.

The preferred embodiment of the invention, described in detail below,reduces cache and TB misses and, in this way, improves program run time.Also, parallel garbage collection improves scaling on multi-processormachines.

Further benefits and advantages of the invention will become apparentfrom a consideration of the following detailed description, given withreference to the accompanying drawings, which specify and show preferredembodiments of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1A, 1B and 1C illustrate, respectively, a breadth first copyorder, a depth first copy order, and a hierarchical copy order.

FIG. 2 is a block diagram illustrating a computer system that may beused in the practice of the present invention.

FIG. 3 is a more detailed block diagram showing a program memory of thecomputer system of FIG. 2.

FIGS. 4-9 show prior art garbage collection copying procedures.

FIG. 10 shows the possible states of a block in to-space in accordancewith a preferred embodiment of the present invention.

FIG. 11 illustrates how the present invention scales in multi-processorsystems.

FIGS. 12 a-12 c show the throughput of this invention on three hardwareplatforms.

FIGS. 13 a-13 f show garbage collection scaling for various benchmarks.

FIGS. 14 a-14 f show the run times of two representative benchmarks.

FIGS. 15 a-15 f illustrate the low cache and TLB misses obtained usingthe present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

In accordance with the present invention, a garbage collection algorithmis provided that achieves hierarchical copy order with parallel garbagecollection threads. FIGS. 2 and 3 illustrate, as an example, onesuitable computer system in which the present invention may be used.This computer system 100, according to the present example, includes acontroller/processor 102, which processes instructions, performscalculations, and manages the flow of information through the computersystem 100. Additionally, the controller/processor 102 iscommunicatively coupled with program memory 104. Included within programmemory 104 are a garbage collector 106, operating system platform 110,Java Programming Language 112, Java Virtual Machine 114, glue software116, a memory allocator 202, Java application 204, a compiler 206, and atype profiler 208. It should be noted that while the present inventionis demonstrated using the Java Programming Language, it would be obviousto those of ordinary skill in the art, in view of the presentdiscussion, that alternative embodiments of the invention are notlimited to a particular computer programming language.

The operating system platform 110 manages resources, such as the datastored in data memory 120, the scheduling of tasks, and processes theoperation of the garbage collector 106 in the program memory 104. Theoperating system platform 110 also manages a graphical display interface(not shown) that directs output to a monitor 122 having a display screen124, a user input interface (not shown) that receives inputs from thekeyboard 126 and the mouse 130, and communication network interfaces(not shown) for communicating with a network link (not shown).Additionally, the operating system platform 110 also manages many otherbasic tasks of the computer system 100 in a manner well known to thoseof ordinary skill in the art.

Glue software 116 may include drivers, stacks, and low level applicationprogramming interfaces (API's) and provides basic functional componentsfor use by the operating system platform 110 and by compatibleapplications that run on the operating system platform for managingcommunications with resources and processes in the computing system 100.

Each computer system 100 may include, inter alia, one or more computersand at least a computer readable medium 132. The computers preferablyinclude means 134 for reading and/or writing to the computer readablemedium 132. The computer readable medium 132 allows a computer system100 to read data, instructions, messages or message packets, and othercomputer readable information from the computer readable medium. Thecomputer readable medium, for example, may include non-volatile memory,such as Floppy, ROM, Flash memory, disk drive memory, CD-ROM, and otherpermanent storage. It is useful, for example, for transportinginformation, such as data and computer instructions, between computersystems.

The present invention, as mentioned above, provides a garbage collectionalgorithm that achieves hierarchical copy order with parallel garbagecollection threads. The prior art has not been able to achieve this. Inorder to best understand the significance and advantages of the presentinvention, several prior art garbage collection algorithms, shown inFIGS. 4-10, are discussed below.

FIG. 4 illustrates Cheney's copying GC algorithm [C. J. Cheney. Anonrecursive list compacting algorithm. Communications of the ACM(CACM), 13(11), 1970]. Memory has two semi-spaces, from-space andto-space. At GC start, all heap objects are in from-space, and all ofto-space is empty. GC first scans the program variables for pointers toheap objects, and copies their target objects from from-space toto-space. Copied objects are gray, and a “free” pointer keeps track ofthe boundary between gray objects and the empty part of to-space. Next,GC scans copied objects for pointers to from-space, and copies theirtarget objects to to-space. Scanned objects are black, and a “scan”pointer keeps track of the boundary between black objects and grayobjects. When the scan pointer catches up to the free pointer, GC hascopied all heap objects that are transitively reachable from the programvariables. From-space is discarded, and the program continues, using theobjects in to-space.

Cheney's algorithm copies in breadth-first order (see FIG. 1A), becauseit scans gray objects first-in-first-out. One advantage of Cheney'salgorithm is that it requires no separate stack or queue to keep trackof its progress, saving space and keeping the implementation simple.Cheney's algorithm uses only one thread for garbage collection, it isnot parallel.

Moon modified Cheney's algorithm to improve locality by copying inhierarchical order instead of breadth-first. FIG. 5 illustrates Moon'salgorithm [David A. Moon. Garbage collection in a large Lisp system. InLISP and Functional Programming (LFP), 1984]. To-space is now dividedinto blocks. As before, objects are copied by bumping the free pointer,which separates gray objects from empty space. But instead of just onescan pointer, Moon maintains two scan pointers. The primary scan pointeris always in the same block as the free pointer. For example, in FIG. 5,both the primary scan pointer and the free pointer point into block D.

If there are gray objects at the primary scan pointer, Moon scans them.If the free pointer reaches the next block (for example E), Moonadvances the primary scan pointer to the start of that block, eventhough there may still be gray objects in the previous block (forexample D). The secondary scan pointer keeps track of the earliest grayobjects (for example, in block B). If the primary scan pointer catchesup with the free pointer, Moon scans from the secondary scan pointer,until the primary scan pointer points to gray objects again. If thesecondary scan pointer catches up with the free pointer as well, GC iscomplete.

Moon's algorithm copies objects in hierarchical order. For example, inFIG. 1C, Moon's algorithm first copies o₁ and its children, o₂ and o₃,into the same block. Next, it copies o₄ (the first child of o₂) into adifferent block. At this point, the block with o₄ has a gray object atthe primary scan pointer, so Moon proceeds to copy the children of o₄into the same block as o₄. Only when it is done with that block does itcontinue from the primary scan pointer, which still points into o₂.

The mutator is the part of an executing program that is not part of theGC: the user program, and run time system components such as the JITcompiler. Moon's GC is concurrent to the mutator, but there is only oneactive GC thread at a time, no parallel GC threads.

One problem with Moon's algorithm is that it scans objects twice whenthe secondary scan pointer advances through already black objects (forexample in block C in FIG. 5).

Wilson, Lam, and Moher, [Paul R. Wilson, Michael S. Lam, and Thomas G.Moher, “Effective: ‘static-graph” reorganization to improve locality ina garbage-collected system” In Programming Language Design andImplementation (PLDI), 1991] improve Moon's algorithm by avoidingre-scanning of black objects. FIG. 6 illustrates Wilson, Lam, andMoher's algorithm. It keeps track of the scan pointers in all partiallyscanned blocks. When the block with the free pointer contains grayobjects (for example block D), scanning proceeds in that block;otherwise, it proceeds from the earliest block with gray objects (forexample block B). The copy order of Wilson, Lam, and Moher's algorithmis identical to that of Moon's algorithm (see FIG. 1C). The hierarchicalcopying GC algorithm by Wilson, Lam, and Moher is neither parallel norconcurrent.

In 1985, Halstead published the first parallel GC algorithm [Robert H.Halstead, Jr. Multilisp: A language for concurrent symbolic computation.Transactions on Programming Languages and Systems (TOPLAS), 7(4), 1985].It is based on Baker's GC [Henry G. Baker, Jr. List processing in realtime on a serial computer. Communications of the ACM (CACM), 21(4),1978], which is an incremental variant of Cheney's GC [C. J. Cheney. Anonrecursive list compacting algorithm. Communications of the ACM(CACM), 13(11), 1970]. Halstead's GC works on shared-memorymultiprocessor machines with uniform access time to the shared memory.The garbage collector works in SIMD (single instruction, multiple data)style: each worker thread performs the same GC loop on different partsof the heap. The mutator may be SIMD or MIMD (multiple instruction,multiple data). As illustrated in FIG. 7, at any given point in time,either GC threads are running or mutator threads are running, but notboth. The GC is parallel, but not concurrent.

Halstead's algorithm partitions to-space into n equally sized parts onan n-processor machine. FIG. 8 illustrates the heap organization forn=2. Worker thread i has a scan pointer scans and a free pointer frees,which point to gray objects and empty space in their respective parts ofto-space. Termination detection is simple: when scan_(i)=free_(i) forall i, then there are no more gray objects to scan anywhere. Since eachthread has its own private part of to-space, the threads do not need tosynchronize when scanning objects in to-space or allocating memory into-space. But they do need to synchronize on individual objects infrom-space: if two worker threads simultaneously encounter pointers tothe same object in from-space, only one of them should copy it andinstall a forwarding pointer.

Like Cheney, Halstead has the advantage of requiring no separate queueor stack to keep track of gray objects, because within the part ofto-space that belongs to a thread, the objects themselves are laid outcontiguously and form an implicit FIFO queue. The algorithm thereforecopies in breadth-first order (FIG. 1). Unfortunately, the staticpartitioning of to-space into n parts for n processors leads to workimbalance. This imbalance causes two problems: overflow and idleness.Overflow occurs when a worker thread runs out of empty space to copyobjects into. Halstead solves this problem by providing additional emptyspace to worker threads on demand. Idleness occurs when one thread runsout of gray objects to scan while other threads are still busy. Halsteaddoes not address the idleness problem caused by work imbalance.

In 1993, Imai and Tick published the first parallel GC algorithm withload balancing [Akira Imai and Evan Tick. Evaluation of parallel copyinggarbage collection on a shared-memory multiprocessor. IEEE Transactionson Parallel and Distributed Systems, 4(9), 1993]. Their algorithmextends Halstead's algorithm by over partitioning: on an n-processormachine, it partitions to-space into m blocks, where m>n.

FIG. 9 illustrates Imai and Tick's GC. Each GC worker thread has onescan block with gray objects to scan, and one copy block with emptyspace to copy objects into. These blocks may be separate (A and E inThread 1) or aliased (D in Thread 2). A shared work pool holds blockscurrently unused by any thread. When a copy block has no more emptyspace, it is completely gray or black and gray. The thread puts the copyblock into the work pool for future scanning, replacing it with a newempty block. When the scan block has no more gray objects, it iscompletely black, and thus done for this garbage collection: the threadgets rid of it. Then, the thread checks whether its private copy blockhas any gray objects. If yes, it aliases the copy block as scan block.Otherwise, it obtains a new scan block from the shared work pool. Inaddition to having to synchronize on from space objects like Halstead'salgorithm, the algorithm by Imai and Tick also has to synchronizeoperations on the shared work pool.

The aliasing between copy and scan blocks avoids a possible deadlockwhere the only blocks with gray objects also have empty space. Inaddition, it reduces contention on the shared work queue when there aremany GC threads. Imai and Tick's GC only checks for an aliasingopportunity when it needs a new scan block because the old scan block iscompletely black. Imai and Tick evaluated their algorithm on 14 programswritten in a logic language. They report parallel speedups of 4.1× to7.8× on an 8-processor machine. Their metric for speedup is not based onwall-clock time, but rather on GC “work” (number of cells copied plusnumber of cells scanned); it thus does not capture synchronizationoverhead or locality effects. The present invention effectively achieveshierarchical copy order with parallel GC threads.

Baseline Garbage Collector

The implementation of parallel hierarchical copying GC is based on thegenerational GC implemented in IBM's J9 JVM. It uses parallel copyingfor the young generation and concurrent mark-sweep with occasionalstop-the-world compaction for the old generation. This is a populardesign point in products throughout the industry. The baseline GC hasexactly two generations, and young objects remain in the younggeneration for a number of birthdays that is adapted online based onmeasured survival rates. We are only concerned with copying of objectswithin the young generation or from the young generation to the oldgeneration.

The baseline GC uses Imai and Tick's algorithm for the young generation.To accommodate tenuring, each worker thread manages two copy blocks: onefor objects that stay in the young generation, and another for objectsthat get tenured into the old generation. Either block may be aliased asscan block.

Parallel Hierarchical GC

Parallel hierarchical GC achieves hierarchical copy order by aliasingthe copy and scan blocks whenever possible. That way, it usually copiesan object into the same block that contains an object that points to it.This is the parallel generalization of the single-threaded algorithm byWilson, Lam, and Moher that uses the scan pointer in the block withempty space whenever possible. Blocks serve both as the work unit forparallelism and as the decomposition unit for hierarchical copying. Itmay be noted that the term “block”, as used herein including the claims,refers to a cache line or page or other unit of OS and HW support formemory hierarchy.

FIG. 10 shows the possible states of a block in to-space as circles.Transitions labels denote the possible coloring of the block when a GCthread changes its state. Blocks in states freelist, scanlist, and donebelong to the shared work pool. No GC thread scans them or copies intothem, and thus, their coloring cannot change. Blocks in states copy,scan, and aliased belong to a GC thread.

For example, a copy block must have room to copy objects into;therefore, all incoming transition labels to state copy are at leastpartially empty. If the copy block has some gray objects and some emptyspace, then it can serve both as copy block and as scan blocksimultaneously, and the GC aliases it; therefore, the transition fromstate copy to state aliased is labeled with colorings that include bothgray and empty. The state machine in FIG. 10 is non-deterministic: thestate and coloring of a block alone do not determine which transition ittakes. Rather, the transitions depend on the colorings of both the copyblock and the scan block of the worker thread.

TABLE 1 Transition logic in GC thread. scan scan scan copy aliased

or

or

(no action) scan → scanlist scan → done copy → aliased copy → aliased or

aliased → copy (no action) scan → done scanlist → scan scanlist → scan

or

aliased → scan copy → scanlist scan → done freelist → copy freelist →copy copy → scan freelist → copy

aliased → done (can't happen) (can't happen) freelist → copy scanlist →scan

Table 1 shows the actions that the GC thread performs after scanning aslot in an object. For example, if the copy block contains both grayslots and empty space, and the scan block is already aliased with thecopy block (column scan=aliased), no action is necessary before the nextscanning operation. If the copy block contains gray and black and noempty space, or is completely gray, and the scan block is not aliased,the thread transitions the copy block to the aliased state, and eitherputs the scan block back on the scanlist if it still has gray slots, ortransitions it to the done state if it is completely black.

As described in Table 1, parallel hierarchical GC leads to increasedcontention on the scanlist. To avoid this, the preferred implementationcaches up to one block from the scanlist with each thread. Thus, ifthere is a cached block, the action scanlist→scan really obtains thatcached block instead. Likewise, the transition scan→scanlist reallycaches the scan block locally, possibly returning the previously cachedblock to the scanlist in its stead.

Presented below is an evaluation of parallel hierarchical copying GC(PH), compared to parallel breadth-first copying GC (BF).

Like Cheney's algorithm and the other Cheney-based algorithms, parallelhierarchical GC requires no separate mark stack or queue of objects.Instead, the gray objects are consecutive in each block, thus serving asa FIFO queue. On the other hand, like Imai and Tick's algorithm, the GCof this invention requires a shared work pool of blocks to coordinatebetween GC threads. In addition, it requires per-block data to keeptrack of its state and coloring.

After scanning a gray slot, parallel hierarchical GC checks immediatelywhether it became possible to alias the copy block and the scan block.Since this check happens on the innermost loop of the GC algorithm, itmust be fast. The immediacy of this check is what leads to hierarchicalorder like in the algorithms by Moon and by Wilson, Lam, and Moher.

The goal of hierarchical copy order is improved mutator locality. But ofcourse, it also affects GC locality and load balancing. This effect canbe positive or negative.

As mentioned earlier, in the preferred implementation, each GC threadactually manages two copy blocks, one each for young and old objects.Only one of them can be aliased at a time.

Experimental Setup

Experiments were conducted with a modified version of IBM J2SE 5.0 J9 GARelease (IBM's product JVM), running on real hardware in common desktopand server operating systems. This section discusses the methodology.

The platform for the following four sections was a dualprocessor IA32SMT system running Linux. The machine has two 3.06 GHz Pentium 4 Xeonprocessors with hyperthreading. The memory hierarchy consists of an 8 KBL1 data cache (4-way associative, 64 Byte cache lines); a 512 KBcombined L2 cache (8-way associative, 64 Byte cache lines); a 64 entrydata TLB (4 KB pages); and 1 GB of main memory. The platforms for othersections are described there.

TABLE 2 Benchmarks. Name Suite Description MB SPECjbb2005 jbb05 businessbenchmark 149.3 antlr DaCapo parser generator 1.4 banshee other XMLparser 84.6 batik DaCapo movie renderer 15.0 bloat DaCapo bytecodeoptimizer 11.5 chart DaCapo pdf graph plotter 25.0 compress jvm98Lempel-Ziv compressor 8.8 db jvm98 in-memory database 13.6 eclipse otherdevelopment environment 4.8 fop DaCapo XSL-FO to pfd converter 8.5hsqldb DaCapo in-memory JDBC database 22.6 ipsixql Colorado in-memoryXML database 2.5 jack jvm98 parser generator 1.5 javac jvm98 javacompiler 13.3 javalex other lexer generator 1.0 javasrc Ashes codecross-reference tool 61.3 jbytemark other bytecode-level benchmark 6.5jess jvm98 expert shell system 2.3 jpat Ashes protein analysis tool 1.1jython DaCapo Python interpreter 2.1 kawa other Scheme compiler 3.1mpegaudio jvm98 audio file decompressor 1.0 mtrt jvm98 multi-threadedraytracer 10.4 pmd DaCapo source code analyzer 7.0 ps DaCapo postcriptinterpreter 229.3 soot DaCapo bytecode analyzer 33.0

Table 2 shows the benchmark suite, consisting of 26 Java programs:SPECjbb2005, the 7 SPECjvm98 programs, the 10 Da-Capo benchmarks, 2Ashes benchmarks, and 6 other big Java programs. Column “MB” gives theminimum heap size in which the program runs without throwing anOutOfMemoryError. The rest of this discussion reports heap sizes as n×this minimum heap size.

All timing numbers herein are relative.

To reduce the effect of noise on the results, all experiments consist ofat least 9 runs (JVM process invocations), and usually severaliterations (application invocations within one JVM process invocation).For each SPECjvm98 benchmark, a run contains around 10 to 20 iterationsat input size 100. Each run of a DaCapo benchmark contains two or moreiterations on the largest input.

Speedups

This section shows the effect of hierarchical copying on runtime for 25Java programs. A 26th program, SPECjbb2005, is discussed in more detailbelow.

TABLE 3 Speedups for all benchmarks except SPECjbb2005.$\% \mspace{14mu} {Speedup}\mspace{14mu} \left( {1 - \frac{PH}{BP}} \right)\mspace{14mu} {at}\mspace{14mu} {heap}\mspace{14mu} {size}$C.I. # GCs Benchmark 1.33× 2× 4× 10× (4×) (10×) jb +21.9 +22.9 +23.5+20.5 0.6 40 javasre 0 +3.5 0 +3.0 2.5 110 mtrt 0 0 0 +3.4 4.6 482jbytemark +3.3 0 0 0 1.6 1,761 javae +2.8 +0.9 +1.6 +3.0 0.5 309 chart 0+3.0 0 0 3.0 126 jpat 0 0 0 +2.6 0.7 14,737 banshee 0 +2.1 0 0 3.7 6javalex +1.0 +1.0 +1.7 +1.6 0.6 201 jython 0 +1.3 0 0 2.3 893 eclipse 00 +1.2 0 1.0 9 mpegaudio 0 0 0 +1.0 0.9 15 compress 0 0 0 +1.0 1.8 142fop 0 0 0 0 1.1 391 haqldb 0 0 0 0 1.1 239 kawa 0 0 0 0 0.0 13 soot 0 00 0 1.1 237 batik 0 0 0 −1.4 0.7 89 jack 0 −1.4 −0.6 0 0.4 1,440 antlr−1.9 −1.3 −1.0 −1.1 0.9 3,070 jess −2.8 −2.4 −1.5 0 0.7 3,558 ps −3.0−2.7 −2.2 −1.3 0.8 59 bloat 0 −1.7 0 −4.7 1.1 341 pmd −1.8 0 0 −5.1 3.3775 ipsixql −6.0 −6.5 −8.7 −5.9 0.7 3,433

The speedup columns of Table 3 show the percentage by which parallelhierarchical copying (PH) speeds up (+) or slows down (−) run timecompared to the baseline parallel breadth-first copying (BF). They arecomputed as

${1 - \frac{PH}{BF}},$

where PH and BF are the respective total run times. For example, at aheap size of 4× the minimum, parallel hierarchical copying speeds updb's run time by 23.5% compared to breadth-first. When the speedup orslowdown is too small to be statistically significant (based onStudent's t-test at 95% confidence), the table shows a “0”. Column“C.I.” shows the confidence intervals for the 4× numbers as a percentageof the mean run time. The confidence intervals at other heap sizes aresimilar. Finally, Column “#GCs” shows the number of garbage collectionsin the runs at heap size 10×; smaller heaps cause more garbagecollections.

None of the benchmarks experienced speedups at some heap sizes andslowdowns at others. The benchmarks are sorted by their maximum speedupor slowdown at any heap size. Out of these 25 programs, 13 speed up, 4are unaffected, and 8 slow down. The discussion below will show thatSPECjbb2005 also speeds up. While speedups vary across heap sizes, weobserved no pattern. The program with the largest slowdown is ipsixql,which maintains a software LRU cache of objects. Because the objects inthe cache survive long enough to get tenured, but then die, ipsixqlrequires many collections of the old generation. The program with thelargest speedup is db, which experiences similar speedups fromdepth-first copy order. Depth-first copy order requires a mark stack,hence it is not considered further herein.

Parallel hierarchical copy order speeds up the majority of thebenchmarks compared to breadth-first copy order, but slows some down. Itmay be possible to avoid the slowdowns by deciding the copy order basedon runtime feedback.

Mutator vs. Collector Behavior

Parallel hierarchical copying GC tries to speed up the mutator byimproving locality. The discussion above showed that most programs speedup, but some slow down. The discussion immediately below explores howmutator and garbage collection contribute to the overall performance.

TABLE 4 Mutator and collector behavior at heap size 4x. MutatorCollector Time TLB misses Time TLB misses Benchmark $1 - \frac{PH}{BF}$BF PH $1 - \frac{PH}{BF}$ BF PH db +24.3 7.0 5.5 (−) −37.6 0.6 0.6 (0)javasrc 0 1.0 1.0 (0) 0 0.6 0.5 (−) mtrt 0 2.4 2.5 (0) −15.4 0.6 0.5 (−)jbytemark 0 0.3 0.3 (+) +9.4 0.6 0.6 (0) javac +2.0 1.6 1.5 (−) 0 0.60.5 (−) chart 0 0.8 0.8 (0) 0 0.7 0.6 (0) jpat 0 2.6 2.7 (0) 0 0.8 0.8(0) banshee 0 0.4 0.4 (0) −3.3 1.0 1.0 (0) javalex +1.7 0.7 1.2 (+) 00.5 0.5 (0) jython 0 1.5 1.5 (0) −9.0 0.7 0.7 (−) eclipse +3.1 0.9 0.8(−) 0 0.7 0.5 (−) mpegaudio 0 0.4 0.4 (0) −5.7 0.8 0.7 (−) compress 01.2 1.1 (0) 0 1.0 1.0 (0) fop +1.3 1.4 1.2 (0) 0 0.5 0.4 (−) hsqldb 01.2 1.1 (−) 0 0.5 0.5 (0) knwa +0.4 1.3 1.3 (0) −9.6 0.6 0.5 (−) soot 01.7 1.7 (0) −3.9 0.5 0.5 (0) batik 0 0.8 0.8 (0) 0 0.6 0.6 (0) jack 01.2 1.2 (0) −9.2 0.6 0.4 (−) antlr 0 0.8 0.8 (0) −6.5 0.6 0.6 (0) jess 02.1 2.1 (0) −7.2 0.5 0.4 (−) ps 0 1.3 1.7 (+) −25.6 0.5 0.4 (−) bloat 01.2 1.1 (0) −2.7 0.6 0.5 (−) pmd 0 1.6 1.7 (0) −13.5 0.6 0.5 (−) ipsixql−2.9 0.8 0.8 (0) −13.2 0.5 0.4 (−)

Table 4 breaks down the results of running in 4× the minimum heap sizeinto mutator and collector. The “Time” columns show improvementpercentages of parallel hierarchical copying (PH) compared tobreadth-first (BF); higher numbers are better, negative numbers indicatedegradation. The “TLB misses” columns show miss rates per retiredinstruction, in percent (lower is better; which TLB and other hardwarecharacteristics will be discussed below in more detail). A (+) indicatesthat PH has a higher miss rate than BF, a (−) indicates that it has alower miss rate, and a (0) indicates that there is no statisticallysignificant difference. The benchmarks are ordered by the total speedupfrom Table 3.

When there is a measurable change, with few exceptions, the mutatorspeeds up and the collector slows down. Even fop and kawa, whichexperienced no overall speedup, experience a small mutator speedup.Usually, TLB miss rates decrease both in the mutator and in the GC. Forthe mutator, this explains the speedup; for the GC, this does notprevent the slowdown caused by executing more instructions to achievehierarchical order. The large reduction in mutator TLB misses for db(from 7% to 5.5%) leads to an overall speedup despite having the largestGC slowdown (of 37.6%). Hierarchical copying only slows down collectionsof the young generation, but since most objects in db die young,collections of the young generation dominate GC cost.

To conclude, parallel hierarchical copying trades GC slowdown formutator speedup. This is a reasonable tradeoff as long as GC scaling onmultiprocessors is not impacted.

Scaling on Multi-Processor Systems

The discussion herein shows how to achieve hierarchical copy order in aparallel GC. The goal of parallel GC is to scale well in multi-processorsystems by using all CPUs for collecting garbage. This is necessary tokeep up with the mutator, since it uses all CPUs for allocating memoryand generating garbage. The present discussion investigates how wellparallel hierarchical copying GC scales.

FIG. 11 shows how the collector scales for SPECjbb2005. SPECjbb2005, theSPEC Java business benchmark, models a server that uses multipleparallel mutator threads to service transactions against a database. Forthis experiment, the number of mutator threads is fixed at 8, and whenthe mutator threads are stopped for collection, the GC uses between 1and 8 threads. The platform is an IA32 Windows system with four 1.6 GHzPentium 4 Xeon processors with hyperthreading (i.e. 8 logical CPUs), 256KB of L2 cache, 1 MB of L3 cache, and 2 GB of RAM. The heap size is 1GB, out of which the young generation uses 384 MB.

All numbers in FIG. 11 are mutator transactions per GC time. Highernumbers indicate that the mutator gets more mileage out of each secondspent in GC, indicating better GC scaling. There are curves for parallelbreadth-first (BF) copying, parallel hierarchical (PH) copying, and PHwith no cached block (PHNCB). All numbers are normalized to BF at 1thread. The error bars show 95% confidence intervals. With 8 GC workerthreads, both BF and PH run around 3 times faster than with 1 thread.Without the cached block optimization from the above section, PH wouldnot scale: it would run 46% slower with 8 threads than with only 1thread (PHNCB).

Whereas FIG. 11 shows how SPECjbb2005's GC time scales, FIG. 12 showshow its total throughput scales on three hardware platforms. SPECjbb2005measures throughput as transactions per second, which should increasewith the number of parallel mutator threads (“warehouses”). The threeplatforms are:

-   -   a. A 2-processor EM64T system running Linux. The machine has two        3.4 GHz Pentium 4 Xeon processors with hyperthreading, with 1 MB        of L2 cache and 4 GB of RAM. On this machine, SPECjbb2005 used a        1 GB heap with a 384 MB young generation.    -   b. The 4-processor IA32 Windows system from FIG. 11.    -   c. An 8-processor Power system running AIX. The machine has        eight 1.5 GHz Power 5 processors with hyperthreading, with a        total of 144 MB of L3 cache and 16 GB of RAM. On this machine,        we ran SPECjbb2005 in a 3.75 GB heap with a 2.5 GB young        generation.

In each of the graphs 12 a-c, the x-axis shows the number of warehouses(parallel mutator threads), and the y-axis shows the throughput(transactions per second) relative to the BF throughput with 1warehouse. Higher is better in these graphs, because it means that moretransactions complete per second.

On all three platforms, throughput increases until the number ofwarehouses reaches the number of logical CPUs, which is twice the numberof physical CPUs due to hyperthreading. At that point, parallelhierarchical GC has a 3%, 8%, and 5% higher throughput than the baselineGC. Increasing the number of threads further does not increase thethroughput, since there are no additional hardware resources to exploit.But hierarchical GC sustains its lead over the baseline GC even asthreads are increased beyond the peak.

FIG. 13 shows GC scaling for the SPECjvm98 benchmarks except mpegaudio(which does very little GC). The platform is the same as for FIG. 11,and the heap size is 64 MB. Except for mtrt, all of these programs aresingle-threaded. Since the amount of mutator work is constant betweenthe different collectors, FIG. 13 measures parallel GC scaling as theinverse of GC time, normalized to GC throughput for BF with 1 thread.For most SPECjvm98 benchmarks, neither PH nor BF scale well. This is inpart due to their small memory usage compared to SPECjbb2005: there isnot enough work to distribute on the parallel GC worker threads. As forSPECjbb2005, PH with no cached block (PHNCB) scales worse than either PHor BF.

To conclude, parallel hierarchical copying GC scales no worse withincreasing load caused by parallel applications than parallelbreadth-first copying GC. A single-threaded GC, on the other hand, wouldhave a hard time keeping up with the memory demands of several parallelmutators.

Time-Space Tradeoffs

In a small heap, GC has to run more often, because the applicationexhausts memory more quickly. This increases the cumulative cost of GC.On the other hand, in a small heap, objects are closer together, whichshould intuitively improve locality. This section investigates how thesecompeting influences play out.

FIG. 14 shows the run times of two representative benchmarks, SPECjvm98db and javac, at 6 different heap sizes from 1.33× to 10× (occupancy 75%to 10%). The x-axis shows the heap size; each graph carries labels forabsolute heap size at the top and labels for relative heap size at thebottom. The y-axis shows run time relative to the best data point in thegraph. In these graphs, lower is better, since it indicates faster runtime. There are three graphs for each benchmark, one each for totaltime, mutator time, and GC time. While the y-axis for total and mutatortime goes to 1.5, the y-axis for GC time goes to 3.

FIGS. 14 a+d show that parallel hierarchical copying (PH) speeds up themutator for both db and javac. FIGS. 14 b+e show that, as expected,total GC cost is higher in smaller heaps. But this effect is moresignificant for javac than for db, because javac has a higher nurserysurvival rate [Martin Hirzel, Johannes Henkel, Amer Diwan, and MichaelHind. Understanding the connectivity of heap objects. In InternationalSymposium on Memory Management (ISMM), 2002]. That is also the reasonwhy PH slows down the collector for db, while causing no significantchange in collector time for javac. The overall behavior of db isdominated by the mutator speedup caused by PH (FIG. 14 c), whereas theoverall behavior of javac is dominated by the decrease of GC cost inlarger heaps (FIG. 14 f).

This confirms the conclusions from above: parallel hierarchical GCperforms well in both small and large heaps.

Cache and TLB Misses

The goal of hierarchical copying is to reduce cache and TLB misses bycolocating objects on the same cache line or page. This section useshardware performance counters to measure the impact of hierarchicalcopying on misses at different levels of the memory subsystem.

Pentium processors expose hardware performance counters through machinespecific registers (MSRs), and many Linux distributions provide acharacter device, /dev/cpu/*/msr, to access them. Doing modprobe msrensures the presence of this device; for experiments in user mode, thefiles must be readable and writeable for users. The JVM sets up theregisters for collecting the desired hardware events at the beginning ofthe run, and reads them at the beginning and end of GC, accumulatingthem separately for the mutator and the GC.

FIG. 15 shows the results. The x-axis shows the heap size; each graphcarries labels for absolute heap size at the top and labels for relativeheap size at the bottom. The y-axis shows the hardware metric; eachgraph carries labels for relative miss rate at the left and labels forabsolute miss rate at the right. In these graphs, lower is better, sinceit indicates fewer misses. The denominator of all ratios is retiredinstructions. See Table 4 for statistical confidence on the TLB missrates; there are some variations due to noise. The “Bus cycles” eventmeasures for how many cycles the bus between the L2 cache and mainmemory was active. This indicates L2 misses, for which Pentium 4 doesnot provide a reliable direct counter. Note that bus clock speeds areusually an order of magnitude slower than processor clock speeds.Parallel hierarchical copying reduces mutator misses on all measuredlevels of the memory subsystem: L1 data cache, combined L2 cache, andTLB. It reduces misses for both db and javac, at all heap sizes. Asexpected, the reduction in TLB misses is the most significant, becausethe hierarchical GC uses 4 KB blocks as the decomposition unit, whichcoincides with the page size. With BF, db has high L1 and TLB missrates, and PH reduces the miss rates significantly. That explains thelarge speedup that the above sections report for db.

To conclude, parallel hierarchical copying GC reduces TLB misses most,while also reducing L1 and L2 cache misses significantly. These reducedmiss rates translate into reduced run time.

Pointer Distances

The above section already demonstrated that hierarchical copying reducescache and TLB misses. This section validates that it achieves that bycolocating objects on the same cache line or page.

For this experiment, the GC records the distance between the address ofa pointer and the address of the object it points to just after acopying or forwarding operation. Pointers with an absolute distanceunder 64 B are classified as “Line”, and pointers with an absolutedistance between 64 B and 4 KB are classified as “Page”. The numbersonly consider pointers from objects in the young generation to otherobjects in the young generation, and from newly tenured objects in theold generation to other newly tenured objects in the old generation.Among other things, this disregards pointers between young and oldobjects; those have longer distances, but are rare, and hierarchicalcopying cannot colocate them on the same page.

TABLE 5 Pointer distances. BF PH Benchmark Line Page Line Page db 0.09.4 23.6 65.1 SPECjbb2005 0.0 0.5 6.8 72.4 javasrc 0.3 20.8 17.6 32.2mtrt 2.2 28.5 24.1 46.7 jbytemark 0.1 4.0 11.2 11.6 javac 1.2 33.9 33.129.1 chart 0.1 4.9 58.0 5.6 jpat 0.1 7.2 46.3 5.3 banshee 0.6 28.1 15.946.8 javalex 1.6 19.6 21.9 17.2 jython 0.2 11.1 4.5 35.6 eclipse 1.925.9 28.6 37.3 mpegaudio 0.9 33.0 16.7 50.8 compress 5.4 33.2 23.2 40.1fop 0.2 32.8 11.9 52.0 hsqldb 0.1 28.9 20.3 64.9 kawa 3.0 28.0 23.4 32.1soot 6.8 30.1 21.5 38.1 batik 1.4 32.9 20.5 45.5 jack 0.4 35.5 26.4 49.4antlr 2.0 32.8 20.1 44.4 jess 0.3 6.7 8.0 6.5 ps 0.1 24.1 32.2 33.9bloat 4.0 24.5 34.5 26.5 pmd 1.7 28.5 27.4 29.0 ipsixql 1.0 20.2 32.621.1

Table 5 shows pointer distances. For example, db with breadthfirstcopying yields 9.4% pointers that are longer than 64 bytes but under 4KB, whereas parallel hierarchical copying improves that to 65.1%. Exceptfor SPECjbb2005, all runs used heaps of 4× the minimum size.

These numbers show that parallel hierarchical copying succeeds incolocating objects on the same 4 KB page for the majority of thepointers. This explains the reduction in TLB misses observed in Table 4.Also, parallel hierarchical copying colocates objects on the same64-byte cache line much more often than the baseline garbage collector.This explains the noticeable reduction in L1 and L2 cache missesobserved above.

While hierarchical copying is tremendously successful at improvingspatial locality of connected objects, wall-clock numbers from a realsystem (Table 3) paint a more sober picture. This discrepancy underlinesthree points: (i) Hierarchical copying trades GC slowdown for mutatorspeedup. The result of this tradeoff is determined by the concretebenchmark, GC implementation, and platform. (ii) Hierarchical copyingaims at decreasing TLB and cache miss rates. When the applicationworking set is small compared to the memory hierarchy of the machine,miss rates are already so low that decreasing them further helps little.(iii) Hierarchical copying optimizes for the “hierarchical hypothesis”that connectivity predicts affinity. In other words, it assumes thatobjects with connectivity (parents or siblings in the object graph) alsohave affinity (the application accesses them together). Not allapplications satisfy the hierarchical hypothesis.

It should be noted that the present invention, or aspects of theinvention, can be embodied in a computer program product, whichcomprises all the respective features enabling the implementation of themethods described herein, and which—when loaded in a computer system—isable to carry out these methods. Computer program, software program,program, or software, in the present context mean any expression, in anylanguage, code or notation, of a set of instructions intended to cause asystem having an information processing capability to perform aparticular function either directly or after either or both of thefollowing: (a) conversion to another language, code or notation; and/or(b) reproduction in a different material form.

While it is apparent that the invention herein disclosed is wellcalculated to fulfill the objects stated above, it will be appreciatedthat numerous modifications and embodiments may be devised by thoseskilled in the art, and it is intended that the appended claims coverall such modifications and embodiments as fall within the true spiritand scope of the present invention.

1. A method for doing garbage collection (GC) comprising the steps of: a) having multiple threads that simultaneously perform work for GC; b) examining the placement of objects on blocks; and c) changing the placement of objects on blocks based on step (b).
 2. A method of claim 1 including the additional step of calculating a placement of object(s) based on step (b), and using the result of the calculation for step (c).
 3. A method of claim 2, wherein the calculation increases the frequency of intra-block pointers.
 4. A method of claim 2, wherein the calculation increases the frequency of siblings on the same block.
 5. A method of claim 1, further comprising the step of enabling interrupting the scanning of a block, and deferring it for later, after every GC operation of scanning one slot or copying one object.
 6. A method of claim 5 including the additional step of maintaining a scan pointer to the middle of an object in a block that is not actively being scanned by a thread.
 7. A method for doing garbage collection (GC) that uses the following steps: a) having multiple threads that simultaneously perform work for GC; and b) using blocks as the work unit for parallel copying.
 8. A method of claim 7, comprising the further steps of: d) examining the placement of objects on blocks; and e) changing the placement of objects on blocks based on step (d).
 9. A method according to claim 1, where i) roots and remembered sets get scanned to identify target objects in from-space, and the target objects get copied to to-space ii) after an object gets copied, its to-space copy eventually gets scanned to identify zero or more target objects it points to in from-space, and the target objects also get copied to to-space iii) to-space copies of objects get scanned in a defined order, which is used to achieve changing the placement of the copies of target objects on to-space blocks
 10. A method according to claim 9, where for each object that has already been copied to a block in to-space, the object is used to identify zero or more target objects it points to in from-space, and the target objects are copied to another block in to-space, the block of the scanned object having a defined relationship with the block of the copied object.
 11. A method according to claim 10, wherein said block of the scanned object is sometimes caused to be the same as said block of the copied object.
 12. A method according to claim 11, wherein the step that uses a defined order includes the step of, when one of the target objects is copied, checking to determine whether the block containing the scanned object is the same as the block to which the target object gets copied.
 13. A method according to claim 12, wherein the step of checking to determine whether the block containing the scanned object is the same as the block to which the target object gets copied, causes the two blocks to be the same if possible.
 14. A system for doing garbage collection (GC) comprising: a) multiple threads that simultaneously perform work for GC; b) means for examining the placement of objects on blocks; and c) means for changing the placement of objects on blocks based on said examining.
 15. A system of claim 14 including the additional means for calculating a placement of object(s) based on said examining, and for using the result of the calculation for said changing.
 16. A system of claim 15, wherein the calculation increases the frequency of intra-block pointers.
 17. A system of claim 15, wherein the calculation increases the frequency of siblings on the same block.
 18. A system according to claim 14, further comprising means for enabling interrupting the scanning of a block, and deferring it for later, after every GC operation of scanning one slot or copying one object.
 19. A system of claim 18 including the additional means of maintaining a scan pointer to the middle of an object in a block that is not actively being scanned by a thread.
 20. A system according to claim 14, where i) roots and remembered sets get scanned to identify target objects in from-space, and the target objects get copied to to-space ii) after an object gets copied, its to-space copy eventually gets scanned to identify zero or more target objects it points to in from-space, and the target objects also get copied to to-space iii) to-space copies of objects get scanned in a defined order, which is used to achieve changing the placement of the copies of target objects on to-space blocks
 21. A system according to claim 20, where for each object that has already been copied to a block in to-space, the object is used to identify zero or more target objects it points to in from-space, and the target objects are copied to another block in to-space, the block of the scanned object having a defined relationship with the block of the copied object.
 22. A system according to claim 21, wherein said block of the scanned object is sometimes caused to be the same as said block of the copied object.
 23. A system according to claim 22, wherein the means that uses a defined order includes means for, when one of the target objects is copied, checking to determine whether the block containing the scanned object is the same as the block to which the target object gets copied.
 24. A system according to claim 23, wherein the means for checking to determine whether the block containing the scanned object is the same as the block to which the target object gets copied, causes the two blocks to be the same if possible.
 25. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for doing garbage collection (GC), said method steps comprising: a) having multiple threads that simultaneously perform work for GC; b) examining the placement of objects on blocks; and c) changing the placement of objects on blocks based on step (b).
 26. A program storage device of claim 25, wherein said method steps include the additional step of calculating a placement of object(s) based on step (b), and using the result of the calculation for step (c).
 27. A program storage device of claim 26, wherein the calculation increases the frequency of intra-block pointers.
 28. A program storage device of claim 27, wherein the calculation increases the frequency of siblings on the same block.
 29. A program storage device of claim 25, wherein said method steps comprise the additional step of enabling interrupting the scanning of a block, and deferring it for later, after every GC operation of scanning one slot or copying one object.
 30. A program storage device of claim 29, wherein said method steps include the additional step of maintaining a scan pointer to the middle of an object in a block that is not actively being scanned by a thread.
 31. A program storage device according to claim 25, where i) roots and remembered sets get scanned to identify target objects in from-space, and the target objects get copied to to-space ii) after an object gets copied, its to-space copy eventually gets scanned to identify zero or more target objects it points to in from-space, and the target objects also get copied to to-space iii) to-space copies of objects get scanned in a defined order, which is used to achieve changing the placement of the copies of target objects on to-space blocks.
 32. A garbage collection method of copying objects from a from-space to a to-space, comprising the steps of: using a plurality of threads to copy a plurality of said objects simultaneously from said from-space to a plurality of blocks in said to-space; examining the placement of the objects in said blocks; and changing the placement of objects in said blocks based on said examining.
 33. A method according to claim 32, wherein: the using step includes the step of using said multiple threads to scan some of said plurality of blocks; each of said threads scans one of said plurality of blocks at a time; each of said plurality of blocks is, at one time, a copy block; and at least some of said plurality of blocks transition from a copy block to a scan block.
 34. A method according to claim 33, wherein: a first object is copied into a given one of the blocks, and a second objects is copied into another one of the blocks, said another one of the blocks having a defined relationship with said given one of the blocks; said given one of the blocks is caused to be the same block as said another one of the blocks; and comprising the further step of the second object is copied to said another one of the blocks, checking to determine whether said another one of the copy blocks is the same as said given one of the blocks; and if the said determination is that the said another one of the blocks is not the same as said given one of the blocks, then for said second object, the block into which the object is copied is chosen to be the same block as said given one of the blocks, if possible. 